Skip to content

[cli] Add mosaic command-line tool#66

Open
jianguotian wants to merge 27 commits into
apache:mainfrom
jianguotian:feat/mosaic-cli
Open

[cli] Add mosaic command-line tool#66
jianguotian wants to merge 27 commits into
apache:mainfrom
jianguotian:feat/mosaic-cli

Conversation

@jianguotian

@jianguotian jianguotian commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Inspecting a Mosaic file used to mean writing Rust against the library API. This adds a mosaic binary to inspect, query and import files from the shell, with a command surface aligned to parquet-cli so the workflow is familiar.

Commands — all support --json

Command Shows
schema column names, Arrow types, nullability, bucket assignment
meta row groups, row counts, per-column stats (null_count/min/max)
cat / head rows as a table; -n, --all, -c/--columns, --where (stats pushdown)
count total row count
convert import CSV or JSON lines into a new file (schema inferred, --stats)
pages per-column encoding (plain/const/dict/all_null) + slot size
footer magic, version, bucket count, compression
column-size on-disk bytes per column + total compression ratio
dictionary dump a dict-encoded column (-c/--column)
buckets bucket layout per row group (monolithic vs paged) + ratio
mosaic convert data.csv -o data.mosaic --stats id
mosaic cat data.mosaic --all --where "id>100"   # skips row groups via min/max
mosaic count data.mosaic

Structure

  • New cli workspace crate: clap commands, text/JSON renderers, where filter + stats pushdown, CSV/JSON import; reads via a file-backed InputFile (pread)
  • Core: read-only additive accessors only — no format or behavior change; convert uses the existing MosaicWriter
  • Docs: docs/cli.html and cli/README.md cover every command with text/JSON examples

Tests

221 pass, 0 fail — cli e2e covers all commands incl. compression ratio, where filter, stats pushdown boundaries and CSV/JSON round-trip; core unchanged (126 + 53 + 21)

mingfeng and others added 3 commits June 16, 2026 04:22
Mosaic previously shipped no viewer tooling — inspecting a file meant
writing Rust against the library API. Add a `mosaic` binary (a new `cli`
workspace crate) mirroring parquet-cli:

- schema: column names, Arrow types, nullability, bucket assignment
- meta:   row groups, rows, per-column stats (null_count/min/max)
- cat:    first N rows as a table, with -n and --columns projection
- pages:  per-column encoding (plain/const/dict/all_null) + slot size

All commands support --json. The reader is driven over a new file-backed
InputFile (pread). Core gains three small read-only accessors used by
`pages`: BucketReader::encodings(), ColumnPageReader::encoding(), and
MosaicReader::page_infos(). No format/behavior change; 199 core tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a core regression test for MosaicReader::page_infos asserting
plain/dict/const detection on a paged-bucket file, and CLI unit tests
for the fmt helpers (json escaping, value/encoding rendering, ndjson
null handling, table truncation).
Drive the mosaic binary against a fixture file (via CARGO_BIN_EXE) and
assert stdout for schema/meta/pages/cat, --json output, projection,
row truncation and missing-file failure. No external dev-deps.
@JingsongLi

Copy link
Copy Markdown
Contributor

Can you also add documentation? And can you compare the API design differences with Parquet CLI?

Adds docs/cli.html documenting the mosaic inspector (schema/meta/pages/cat,
text + JSON) with a parquet-cli command mapping and design-difference table,
addressing the review asks on apache#66. Adds CLI to the nav across doc pages.
mingfeng added 8 commits June 22, 2026 03:26
Align the viewer command set with parquet-cli/arrow-rs: head (alias of
cat), footer (magic/version/buckets/compression), column-size (on-disk
bytes per column), dictionary (dump dict-encoded entries). Core gains
compression()/dict_values()/dictionary() read-only accessors. e2e tests
cover the new commands.
Mosaic's column-bucket grouping has no parquet equivalent. Add a
buckets command printing, per row group, each bucket's kind
(empty/monolithic/paged), on-disk size and member columns. Core gains
MosaicReader::bucket_infos(). e2e covered.
Align dictionary column selection with parquet-cli's -c flag instead of
a positional argument; update e2e.
Completes JSON output across all 9 commands; dict columns emit an array,
non-dict row groups emit null. e2e extended.
Expand docs/cli.html and cli/README.md to cover every command
(schema/meta/footer/buckets/pages/dictionary/column-size/cat/head) with
usage and example output. Drop all comparison content per maintainer
preference.
Remove the near-trivial encoding_names mapping test; extend footer and
buckets e2e to cover their --json output, improving CLI feature coverage.
The e2e tests carry their own fixture writer; the standalone gen.rs
example duplicated it and was unreferenced.
@jianguotian jianguotian changed the title feat(cli): add mosaic inspector CLI (schema/meta/cat/pages) feat(cli): add mosaic inspector CLI (9 commands, text + JSON) Jun 22, 2026
@jianguotian jianguotian changed the title feat(cli): add mosaic inspector CLI (9 commands, text + JSON) feat(cli): add mosaic inspector CLI Jun 22, 2026
@jianguotian jianguotian changed the title feat(cli): add mosaic inspector CLI feat(cli): add mosaic inspector CLI Jun 22, 2026
mingfeng added 5 commits June 22, 2026 08:15
column-size summed PageInfo.slot_size, which is 0 for monolithic buckets
(the default for small files), so every column reported 0 B. Attribute each
bucket's on-disk size to its columns via bucket_infos instead; add a
default-threshold e2e so monolithic files are covered. Also emit Date32 as a
bare epoch-day integer in cat --json for consistency with other numerics.
BucketInfo now carries uncompressed size (exact for monolithic, unknown for
paged). column-size prints a total with uncompressed bytes + ratio; buckets
shows per-bucket ratio when known. Tests + docs updated.
count prints total rows; --all drops the -n cap; --where applies one
column/op/value condition (=,!=,>,>=,<,<=), numeric or string. Tests + docs.
convert imports a CSV into a new Mosaic file (schema inferred, --stats picks
min/max columns). cat --where now skips row groups whose min/max provably
exclude the predicate, conservative (any missing stat keeps the group).
Tests + docs.
convert dispatches on extension: .json/.ndjson/.jsonl read one object per line
via arrow-json (CSV path unchanged). Test + docs.
@jianguotian jianguotian changed the title feat(cli): add mosaic inspector CLI feat(cli): add mosaic command-line inspector (11 commands) Jun 23, 2026
mingfeng added 2 commits June 22, 2026 21:36
… atomicity

- page_infos validates paged slot sizes sum to total (rejects forged sizes
  that could drive a ~4GiB read, matching the projected reader)
- cat --where: ordering ops on a non-numeric column/value error instead of
  silently dropping all rows
- cat --json: NaN/Infinity emit null (was invalid JSON)
- convert writes a temp file and renames on success (no truncated output)
@jianguotian jianguotian changed the title feat(cli): add mosaic command-line inspector (11 commands) [cli] Add mosaic command-line tool Jun 23, 2026

@JingsongLi JingsongLi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few focused comments on CLI semantics and Parquet CLI alignment.

Comment thread cli/src/fmt.rs Outdated
Float32 if !arr.as_any().downcast_ref::<Float32Array>().unwrap().value(row).is_finite() => "null".into(),
Float64 if !arr.as_any().downcast_ref::<Float64Array>().unwrap().value(row).is_finite() => "null".into(),
// Date32 is an epoch-day integer; emit bare like other numerics.
_ => cell(arr, row),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cat --json can emit invalid JSON for Mosaic types that are supported by the core reader but not rendered by cell(). For example Binary, Decimal, Time32, Timestamp, List, and Map currently fall through to ?, and this branch writes that value unquoted, producing output like "col":?. Parquet/Arrow CLI paths use a real JSON serializer or type-aware value rendering; please either route this through Arrow JSON writing or cover all supported Mosaic types explicitly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cat --json can emit invalid JSON for Mosaic types that are supported by the core reader but not rendered by cell(). For example Binary, Decimal, Time32, Timestamp, List, and Map currently fall through to ?, and this branch writes that value unquoted, producing output like "col":?. Parquet/Arrow CLI paths use a real JSON serializer or type-aware value rendering; please either route this through Arrow JSON writing or cover all supported Mosaic types explicitly.

Fixed, cat --json now goes through Arrow's JSON writer, so all reader types render as valid JSON.

Comment thread cli/src/fmt.rs Outdated
let mask: Vec<bool> = (0..batch.num_rows()).map(|r| {
if col.is_null(r) { return false; }
let lhs = cell(col.as_ref(), r);
match (lhs.parse::<f64>(), w.value.parse::<f64>()) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes equality semantics depend on whether the rendered strings happen to parse as numbers. On a Utf8 column, --where s=1 matches both "1" and "01" because both sides parse as f64. Since Parquet CLI does not define this filter behavior, Mosaic should make the semantics type-driven: exact string comparison for Utf8, numeric comparison only for numeric columns.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes equality semantics depend on whether the rendered strings happen to parse as numbers. On a Utf8 column, --where s=1 matches both "1" and "01" because both sides parse as f64. Since Parquet CLI does not define this filter behavior, Mosaic should make the semantics type-driven: exact string comparison for Utf8, numeric comparison only for numeric columns.

Fixed, comparison is type-driven now: numeric columns compare numerically, others as exact strings. s=01 no longer matches 1.

Comment thread cli/src/main.rs
// paged layouts); split a bucket's size across its member columns.
for b in reader.bucket_infos(rg)? {
if b.columns.is_empty() { continue; }
split_evenly(b.size, &b.columns, &mut bytes);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This differs materially from Parquet CLI column-size, which sums exact column-chunk sizes from metadata. Here the bucket size is split evenly across member columns, so the per-column bytes can be quite misleading; for paged buckets we should be able to use per-column slot sizes instead. If monolithic bucket attribution must remain approximate, please label it as approximate in the command output/docs rather than presenting it as exact per-column size.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This differs materially from Parquet CLI column-size, which sums exact column-chunk sizes from metadata. Here the bucket size is split evenly across member columns, so the per-column bytes can be quite misleading; for paged buckets we should be able to use per-column slot sizes instead. If monolithic bucket attribution must remain approximate, please label it as approximate in the command output/docs rather than presenting it as exact per-column size.

Fixed, paged buckets use exact per-column slot sizes; monolithic multi-column buckets are split and labeled (approx).

Comment thread cli/README.md Outdated

## Commands

Every command accepts `--json`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert does not accept --json, so this statement is currently inaccurate. Please scope this to the inspection/query commands or add a JSON mode to convert.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert does not accept --json, so this statement is currently inaccurate. Please scope this to the inspection/query commands or add a JSON mode to convert.

Fixed, scoped the statement to inspection/query commands; convert writes a file.

mingfeng added 2 commits June 23, 2026 00:44
…tial

Paged buckets don't record uncompressed size, so summing across a mixed file
gave a misleading total ratio (e.g. 0.01x). Only report the total ratio when
every non-empty bucket's uncompressed size is known.
… column-size

- cat --json renders via Arrow JSON writer: all reader types valid, NaN->null
- where: numeric cols compare numerically, others exact string; ordering numeric-only
- column-size: paged uses exact slot sizes, monolithic multi-col marked (approx)
- README: scope --json to inspection/query commands (convert writes a file)
mingfeng added 4 commits June 23, 2026 04:50
…dges

- Cat/Head: single variant with visible_alias=head (no duplicated fields)
- stats_exclude reads Value numerically (to_f64) — Date/Time pushdown no longer
  silently disabled by the (epoch-day) string suffix
- pretty_table/ndjson guard empty input; cell shows <Type> not ? for unhandled
- page_infos slot-sum uses checked_add; docs --json scope fixed
- where and stats_exclude compare integers in i128 (full i64 range exact);
  large ids (Snowflake) no longer mis-skip or mis-match via lossy f64
- ndjson returns io::Result instead of .expect, keeping the CLI error contract
Replace arrow-array/schema/select/csv/json with arrow = {features=csv,json};
one version to bump, paths via arrow::{array,datatypes,error,compute,csv,json}.
@jianguotian

Copy link
Copy Markdown
Contributor Author

Can you also add documentation? And can you compare the API design differences with Parquet CLI?

Docs added — docs/cli.html and cli/README.md.

Compared with parquet-cli:

  1. Most commands is familiar with parquet-cli (schema/meta/cat/count/convert,
    -c/--json/--where).
  2. The new part is Mosaic's buckets: the buckets command shows the monolithic/paged
    layout, and schema/pages show which bucket each column lands in:
$ mosaic buckets data.mosaic
    bucket 0: monolithic 15B [flag]
    bucket 1: paged 353B [id]
    bucket 2: paged 32B [kind]

$ mosaic schema data.mosaic
  id:   Int32 [bucket 1]
  flag: Int32 [bucket 0]
  1. It's a single native binary, no JVM.

mingfeng added 2 commits June 23, 2026 06:05
Read the filter column even when projected out, then drop it before printing,
so cat -c id,name --where score>440 filters instead of erroring.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants